Genome Sequencing in Open Microfabricated High Density Picoliter Reactors
نویسندگان
چکیده
We describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel 60×60 mm2 fibreoptic slide containing 1,600,000 individual wells and is able to sequence 25 million bases, at 99% or better accuracy (phred 20), in a 4 hour run. To provide sequencing templates, we clonally amplify DNA fragments on beads in the droplets of an emulsion. The template-carrying beads are loaded into the wells to convert each into a picoliter-scale sequencing reactor. We perform sequencing by synthesis using a pyrosequencing protocol optimized for solid support and the small dimension of the open reactors. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembling the Mycoplasma genitalium genome with 96% coverage at 99.96 % accuracy in one run of the machine. DNA sequencing has dramatically changed the nature of biomedical research and medicine. Reductions in the cost, complexity and time required to sequence large amount of DNA, including improvements in the ability to sequence bacterial and eukaryotic genomes will have significant scientific, economic and cultural impact. Large scale sequencing projects, including Correspondence and requests for materials should be addressed to J. M. Rothberg ([email protected]). Sequences for M. genitalium and S. pneumoniae were deposited at DDBJ/EMBL/GenBank under accession numbers AAGX01000000 and AAGY01000000 respectively. *These authors contributed equally to the work. Supplementary Information accompanies the paper on www.nature.com/nature. N IH PA Athor M anscript N IH PA Athor M anscript N IH PA Athor M anscript Margulies et al. Page 2 of 14 whole genome sequencing, have usually required the cloning of DNA fragments into bacterial vectors, amplification and purification of individual templates, followed by Sanger sequencing 1 using fluorescent chain-terminating nucleotide analogues 2 and either slab gel or capillary electrophoresis. Current estimates put the cost of sequencing a human genome between $10 and $25 million 3. Alternative sequencing methods have been described 4, 5, 6, 7, 8 however, no technology has displaced the use of bacterial vectors and Sanger sequencing as the main generators of sequence information. In this paper we describe an integrated system whose throughput routinely enables applications requiring millions of bases of sequence information, including whole genome sequencing Our focus has been on the co-development of an emulsion-based method 9, 10, 11 to isolate and amplify DNA fragments in vitro, and of a fabricated substrate and instrument that performs pyrophosphate-based sequencing (“pyrosequencing” 5, 12) in picoliter-sized wells. In a typical run we generate over 25 million bases with a phred 20 or better quality score (predicted to have an accuracy of 99% or higher). While this phred 20 quality throughput is significantly higher than that of Sanger sequencing by capillary electrophoresis, it is currently at the cost of substantially shorter reads and lower average individual read accuracy 13. We further characterize the performance of the system, and demonstrate that it is possible to assemble bacterial genomes de novo from relatively short reads, by sequencing a known bacterial genome, Mycoplasma genitalium (580 kbp), and comparing our shotgun sequencing and de novo assembly with the results originally obtained for this genome 14. The results of shotgun sequencing and de novo assembly of a larger bacterial genome, Streptococcus pneumoniae 15 (2.1 Mbp), are presented in Supplementary Table 4. Emulsion based sample preparation We generate random libraries of DNA fragments by shearing an entire genome and isolating single DNA molecules by limiting dilution (Supplementary Methods: Library Preparation). Specifically, we randomly fragment the entire genome, add specialized common adapters to the fragments, capture the individual fragments on their own beads and, within the droplets of an emulsion, clonally amplify the individual fragment (Figure 1A and 1B). Unlike in current sequencing technology, our approach does not require subcloning in bacteria or the handling of individual clones; the templates are handled in bulk within the emulsions 9, 10, 11. Sequencing in fabricated picoliter sized reaction vessels We perform sequencing by synthesis simultaneously in open wells of a fibreoptic slide using a modified pyrosequencing protocol that is designed to take advantage of the small scale of the wells. The fibreoptic slides are manufactured by slicing of a fibreoptic block that is obtained by repeated drawing and fusing of optic fibres. At each iteration, the diameters of the individual fibres decrease as they are hexagonally packed into bundles of increasing cross-sectional sizes. Each fibreoptic core is 44 μm in diameter and surrounded by 2–3 μm of cladding; etching of each core creates reaction wells approximately 55 μm in depth with a centre-to-centre distance of 50 μm (Figure 1C), resulting in a calculated well size of 75 pL and a well density of 480 wells/mm2. The slide, containing approximately 1.6 million wells16, is loaded with beads and mounted in a flow chamber designed to create a 300 μm high channel, above the well openings, through which the sequencing reagents flow (Figure 2, A and B). The unetched base of the slide is in optical contact with a second fibreoptic imaging bundle bonded to a CCD sensor, allowing the capture of emitted photons from the bottom of each individual well (Figure 2, C, and Supplementary Methods: Imaging System). We developed a three-bead system, and optimized the components to achieve high efficiency on solid support. The combination of picoliter-sized wells, enzyme loading uniformity allowed Nature. Author manuscript; available in PMC 2006 September 15. N IH PA Athor M anscript N IH PA Athor M anscript N IH PA Athor M anscript Margulies et al. Page 3 of 14 by the small beads and enhanced solid support chemistry enabled us to develop a method that extends the useful read length of sequencing-by-synthesis to 100 bp (Supplementary Methods: Sequencing). In the flow-chamber cyclically delivered reagents flow perpendicularly to the wells. This configuration allows simultaneous extension reactions on template carrying beads within the open wells and relies on convective and diffusive transport to control the addition or removal of reagents and by-products. The time scale for diffusion into and out of the wells is on the order of 10 seconds in the current configuration and is dependent on well depth and flow channel height. The time scales for the signal-generating enzymatic reactions are on the order of 0.02–1.5 seconds (Supplementary Methods: Interwell Diffusion). The current reaction is dominated by mass transport effects and improvements based on faster delivery of reagents are possible. Well depth was selected based on a number of competing requirements: (i) wells need to be deep enough for the DNA-carrying beads to remain in the wells in the presence of convective transport past the wells, (ii) they must be sufficiently deep to provide adequate isolation against diffusion of by-products from a well in which incorporation is taking place to a well where no incorporation is occurring, and (iii) they must be shallow enough to allow rapid diffusion of nucleotides into the wells, and rapid washing out of remaining nucleotides at the end of each flow cycle to enable high sequencing throughput and reduced reagent use. Following the flow of each nucleotide, a wash containing a nuclease is used to ensure that nucleotides do not remain in any well prior to the next nucleotide being introduced. Base Calling of Individual Reads Nucleotide incorporation is detected by the associated release of inorganic pyrophosphate (PPi) and the generation of photons 5, 12. Wells containing template-carrying beads are identified by detecting a known four-nucleotide “key” sequence at the beginning of the read (Supplementary Methods: Image Processing). Raw signals are background-subtracted, normalized and corrected. The normalized signal intensity at each nucleotide flow, for a particular well, indicates the number of nucleotides, if any, that were incorporated. This linearity in signal is preserved to at least homopolymers of length 8 (Supplementary Figure 6). In sequencing by synthesis a very small number of templates on each bead lose synchronism (i.e. either get ahead of, or fall behind, all other templates in sequence 17). The effect is primarily due to leftover nucleotides in a well (creating “carry forward”) or to incomplete extension. Typically, we observe a carry forward rate of 1–2% and an incomplete extension rate of 0.1–0.3%. Correction of these shifts is essential because the loss of synchronism is a cumulative effect that degrades the quality of sequencing at longer read lengths. We have developed algorithms, based on detailed models of the underlying physical phenomena, that allow us to determine, and correct for, the amounts of carry forward and incomplete extension occurring in individual wells (Supplementary Methods: Signal Processing). Figure 3 shows the processed result, a 113 bp long read generated in the M. genitalium run discussed below. To assess sequencing performance and the effectiveness of the correction algorithms, independently of artifacts introduced during the emulsion-based sample preparation, we created test fragments with difficult-to-sequence stretches of identical bases of increasing length (homopolymers) (Supplementary Methods: Test Fragments and Supplementary Figure 4). Using these test fragments, we have verified that at the individual read level we achieve base call accuracy of approximately 99.4%, at read lengths in excess of 100 bp (Table 1). High Quality Reads and Consensus Accuracy. Prior to base calling or aligning reads, we select high quality reads without relying on a priori knowledge of the genome or template being sequenced (Supplementary Methods: High Quality Reads). This selection is based on the observation that poor quality reads have a high Nature. Author manuscript; available in PMC 2006 September 15. N IH PA Athor M anscript N IH PA Athor M anscript N IH PA Athor M anscript Margulies et al. Page 4 of 14 proportion of signals that do not allow a clear distinction between a flow during which no nucleotide was incorporated and a flow during which one or more nucleotide was incorporated. When base calling individual reads, errors can occur because of signals that have ambiguous values (Supplementary Figure 5). To improve the usability of our reads, we also developed a metric which allows us to estimate ab initio the quality (or probability of correct base call) of each base of a read, analogous to the phred score 18 used by current Sanger sequencers (Supplementary Methods: Quality Scores and Supplementary Figure 8). Higher quality sequence can be achieved by taking advantage of the high oversampling that our system affords and building a consensus sequence. Sequences are aligned to one another using the signal strengths at each nucleotide flow, rather than individual base calls, to determine optimal alignment (Supplementary Methods: Flow-space Mapping, Consensus Accuracy and Genome Coverage). The corresponding signals are then averaged, after which base calling is performed. This approach greatly improves the accuracy of the sequence (Supplementary Figure 7), and provides an estimate of the quality of the consensus base. We refer to that quality measure as the Z-score; it is a measure of the spread of signals in all the reads at one location and the distance between the average signal and the closest base calling threshold value. In both re-sequencing and de novo sequencing, as the minimum Z-score is raised the consensus accuracy increases, while coverage decreases; approximately half of the excluded bases, as the Z-score is increased, belong to homopolymers of length 4 and larger. Sanger sequencers usually require a depth of coverage at any base of three or more in order to achieve a consensus accuracy of 99.99%. To achieve a minimum of three fold coverage of 95% of the unique portions of a typical genome requires approximately 7 to 8 fold oversampling. Due to our higher error rate, we have observed that comparable consensus accuracies, over a similar fraction of a genome, are achieved with a depth of coverage of 4 or more, requiring approximately 10–12x oversampling. Mycoplasma genitalium (580,069 bp). Mycoplasma genomic DNA was fragmented and prepared into a sequencing library as described above. (This was accomplished by a single individual in 4 hours.) Following emulsion PCR and bead deposition onto a 60×60 mm2 fibreoptic slide, a process which took one individual 6 hours, 42 cycles of 4 nucleotides were flowed through the sequencing system in an automated 4 hour run of the instrument. The results are summarized in Table 2. In order to measure the quality of individual reads, we aligned each High Quality Read to the reference genome at 70% stringency, using flow-space mapping and criteria similar to those used previously in assessing the accuracy of other base callers 18. When assessing sequencing quality, only reads that mapped to unique locations in the reference genome were included. Since this process excludes repeat regions (parts of the genome whose corresponding flowgrams are 70% similar to one another), the selected reads did not cover the genome completely. Figure 4A illustrates the distribution of read lengths for this run. The average read length was 110 bp, the resulting oversample 40 fold, and 84,011 reads (27.4%) were perfect. Figure 4B summarizes the average error as a function of base position. Coverage of non-repeat regions was consistent with the sample preparation and emulsion not being biased (Supplementary Figure 8). At the individual read level, we observe an insertion and deletion error rate of approximately 3.3%; substitution errors have a much lower rate, on the order of 0.5%. When using these reads without any Z-score restriction, we covered 99.94% of the genome in 10 contiguous regions with a consensus accuracy of 99.97%. The error rate in homopolymers is significantly reduced in the consensus sequence (Supplementary Figure 7). Of the bases not covered by this consensus sequence (366 bp), all belonged to excluded repeat regions. Setting a minimum Z-score equal to 4, coverage was reduced to 98.1% of the genome, while consensus accuracy increased to 99.996%. We further demonstrated the reproducibility of the system by repeating the whole genome sequencing of M. genitalium an additional 8 Nature. Author manuscript; available in PMC 2006 September 15. N IH PA Athor M anscript N IH PA Athor M anscript N IH PA Athor M anscript Margulies et al. Page 5 of 14 times, achieving a 40 fold coverage of the genome in each of the 8 separate instrument runs (Supplementary Table 3). We assembled the M. genitalium reads from a single run into 25 contigs with an average length of 22.4 kbp. One of these contigs was misassembled due to a collapsed tandem repeat region of 60 bp, and was corrected by hand. The original sequencing of M. genitalium resulted in 28 contigs prior to directed sequencing used for finishing the sequence 14. Our assembly covered 96.54% of the genome and attained a consensus accuracy of 99.96%. Non-resolvable repeat regions amount to 3% of the genome: we therefore covered 99.5% of the unique portions of the genome. Sixteen of the breaks between contigs were due to non-resolvable repeat regions, 2 were due to missed overlapping reads (our read filter and trimmer are not perfect and the algorithms we use to perform the pattern matching of flowgrams occasionally misses valid overlaps), and the remainder to thin read coverage. Setting a minimum Z-score of 4, coverage was reduced to 95.27% of the genome (98.2% of the resolvable part of the genome) with the consensus accuracy increasing to 99.994%.
منابع مشابه
Microfabricated bioprocessor for integrated nanoliter-scale Sanger DNA sequencing.
An efficient, nanoliter-scale microfabricated bioprocessor integrating all three Sanger sequencing steps, thermal cycling, sample purification, and capillary electrophoresis, has been developed and evaluated. Hybrid glass-polydimethylsiloxane (PDMS) wafer-scale construction is used to combine 250-nl reactors, affinity-capture purification chambers, high-performance capillary electrophoresis cha...
متن کاملMultilayer High-density 3d Nanowell Arrays for Digital Biology
Digital biology based droplets and micro/nanowells are useful as chemical and bio-chemical reactors discretizing reagents into picoliter to nanoliter sample volumes for individual analysis of cells, organisms, or molecules. Presented in this work is a multilayer microwell design to achieve three-dimensional, 3D, nanowell patterns for high-density imaging arrays. Using this technique, as much as...
متن کاملGate-controlled Microfluidic Chamber with Magnetic Bead for Dna Sequencing-by-synthesis Technology
In this paper we present a novel microfluidic platform for DNA sequencing-by-synthesis methods (e.g. pyrosequencing). The proposed platform is based on the valve-controllable PDMS channel technology with DNA-coated magnetic beads. The encapsulation of the reaction of DNA polymerization in picoliter-sized wells provides for excellent isolation and control for detection. This separation prevents ...
متن کاملI-20: Towards The Transparent Embryo: Dynamics and Ethics of Comprehensive Preimplantation Genetic Screening
Background: To study the ethical aspects of comprehensive preimplantation genetic screening (PGS) through microarrays and whole genome sequencing Materials and Methods: In order to pinpoint ethical issues regarding comprehensive embryo screening we have first investigated the technical and moral issues by organizing a campus meeting with experts and by a literature study. Subsequently we have i...
متن کاملI-37: Establishing High Resolution Genomic Profiles of Single Cells Using Microarray and Next-Generation Sequencing Technologies
The nature and pace of genome mutation is largely unknown. Standard methods to investigate DNA-mutation rely on arraying or sequencing DNA from a population of cells, hence the genetic composition of individual cells is lost and de novo mutation in cell(s) is concealed within the bulk signal. We developed methods based on (SNP-) arraying and next-generation sequencing of single-cell whole-genom...
متن کاملProduction of an economic high-density concrete for shielding megavoltage radiotherapy rooms and nuclear reactors
Background: In megavoltage radiotherapy rooms, ordinary concrete is usually used due to its low construction costs, although higher density concrete are sometimes used, as well. The use of high-density concrete decreases the required thickness of the concrete barrier hence, its disadvantage is its high cost. In a nuclear reactor, neutron radiation is the most difficult to shield. A method for p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005